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Do We Need More Training Data? 


Xiangxin Zhu • Carl Vondrick • Charless C. Fowlkes • Deva Ramanan 


Abstract Datasets for training object recognition sys¬ 
tems are steadily increasing in size. This paper inves¬ 
tigates the question of whether existing detectors will 
continue to improve as data grows, or saturate in perfor¬ 
mance due to limited model complexity and the Bayes 
risk associated with the feature spaces in which they 
operate. We focus on the popular paradigm of discrimi- 
natively trained templates defined on oriented gradient 
features. We investigate the performance of mixtures of 
templates as the number of mixture components and 
the amount of training data grows. Surprisingly, even 
with proper treatment of regularization and “outliers”, 
the performance of classic mixture models appears to 
saturate quickly (^10 templates and ^100 positive train¬ 
ing examples per template). This is not a limitation of 
the feature space as compositional mixtures that share 
template parameters via parts and that can synthesize 
new templates not encountered during training yield 
significantly better performance. Based on our analy¬ 
sis, we conjecture that the greatest gains in detection 
performance will continue to derive from improved rep¬ 
resentations and learning algorithms that can make 
efficient use of large datasets. 
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models 
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1 Introduction 


Much of the impressive progress in object detection is 
built on the methodologies of statistical machine learn¬ 
ing, which make use of large training datasets to tune 
model parameters. Consider the benchmark results of 
the well-known PASCAL VOC object challenge (Fig. [^. 
There is a clear trend of increased benchmark perfor¬ 
mance over the years as new methods have been de¬ 
veloped. However, this improvement is also correlated 
with increasing amounts of training data. One might be 
tempted to simply view this trend as a another case of 
the so-called “effectiveness of big-data”, which posits 
that even very complex problems in artificial intelligence 
may be solved by simple statistical models trained on 


massive datasets (Halevy et al 2009). This leads us to 


consider a basic question about the field: will continu¬ 
ally increasing amounts of training data be sufficient to 
drive continued progress in object recognition absent the 
development of more complex object detection models? 

To tackle this question, we collected a massive train¬ 
ing set that is an order of magnitude larger than existing 
collections such as PASCAL ( [Everingham et 2010). 
We follow the dominant paradigm of scanning-window 
templates trained with linear SVMs on HOG features 


(Dalai and Triggs 2005 Felzenszwalb et al 2010 Bour- 


dev and Malik 2009 Malisiewicz et al 2011), and eval 


uate detection performance as a function of the amount 
of training data and the model complexity. 

Challenges: We found there is a surprising amount 
of subtlety in scaling up training data sets in current sys¬ 
tems. For a fixed model, one would expect performance 
to generally increase with the amount of data and even¬ 
tually saturate (Fig. |^. Empirically, we often saw the 
bizarre result that off-the-shelf implementations show 
decreased performance with additional data! One would 
also expect that to take advantage of additional train- 
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Year 
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Fig. 1 The best reported performance on PASCAL VOC 
challenge has shown marked increases since 2006 (top). This 
could be due to various factors: the dataset itself has evolved 
over time, the best-performing methods differ across years, 
etc. In the bottom-row, we plot a particular factor - training 
data size - which appears to correlate well with performance. 
This begs the question: has the increase been largely driven 
from the availability of larger training sets? 


ing data, it is necessary to grow the model complexity, 
in this case by adding mixture components to capture 
different object sub-categories and viewpoints. However, 
even with non-parametric models that grow with the 
amount of training data, we quickly encountered dimin¬ 
ishing returns in performance with only modest amounts 
of training data. 

We show that the apparent performance ceiling is not 
a consequence of HOG+linear classifiers. We provide an 
analysis of the popular deformable part model (DPM), 
showing that it can be viewed as an efficient way to 
implicitly encode and score an exponentially-large set of 
rigid mixture components with shared parameters. With 
the appropriate sharing, DPMs produce substantial per¬ 
formance gains over standard non-parametric mixture 
models. However, DPMs have fixed complexity and still 
saturate in performance with current amounts of train¬ 
ing data, even when scaled to mixtures of DPMs. This 
difficulty is further exacerbated by the computational 
demands of non-parametric mixture models, which can 
be impractical for many applications. 


Proposed solutions: In this paper, we offer ex¬ 
planations and solutions for many of these difficulties. 
First, we found it crucial to set model regularization as 
a function of training dataset using cross-validation, a 
standard technique which is often overlooked in current 
object detection systems. Second, existing strategies for 
discovering sub-category structure, such as clustering 
aspect ratios (Felzenszwalb et al 2010), appearance fea¬ 


tures (Divvala et al 2012), and keypoint labels (Bourdev 



Fig. 2 We plot idealized curves of performance versus train¬ 
ing dataset size and model complexity. The effect of additional 
training examples is diminished as the training dataset grows 
(left), while we expect performance to grow with model com¬ 
plexity up to a point, after which an overly-flexible model 
overfits the training dataset (right). Both these no tions can be 
made precise with learning theory bounds, see e.g. (|McAllester| 

[^. - 


and Malik 2009) may not suffice. We found this was re¬ 


lated to the inability of classifiers to deal with “polluted” 
data when mixture labels were improperly assigned. 
Increasing model complexity is thus only useful when 
mixture components capture the “right” sub-category 
structure. 

To efficiently take advantage of additional training 
data, we introduce a non-parametric extension of a 
DPM which we call an exemplar deformable part model 
(EDPM). Notably, EDPMs increase the expressive power 
of DPMs with only a negligible increase in computation, 
making them practically useful. We provide evidence 
that suggests that compositional representations of mix¬ 
ture templates provide an effective way to help target 
the “long-tail” of object appearances by sharing local 
part appearance parameters across templates. 

Extrapolating beyond our experiments, we see the 
striking difference between classic mixture models and 
the non-parametric compositional model (both mixtures 
of linear classifiers operating on the same feature space) 
as evidence that the greatest gains in the near future 
will not be had with simple models+bigger data, but 
rather through improved representations and learning 
algorithms. 

We introduce our large-scale dataset in Sec. de¬ 
scribe our non-parametric mixture models in Sec. 
present extensive experimental results in Sec. and 
conclude with a discussion in Sec. including related 
work. 


2 Big Detection Datasets 

Throughout the paper we carry out experiments using 
two datasets. We vary the number of positive training 
examples, but in all cases keep the number of negative 
training images fixed. We found that performance was 
relatively static with respect to the amount of negative 
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training data, once a sufficiently large negative training 
set was used. 

PASCAL: Our first dataset is a newly collected 
data set that we refer to as PASCAL-lOX and describe 
in detail in the following section This dataset covers 
the 11 PASCAL categories (see Fig. and includes 
approximately 10 times as many training examples per 
category as the standard training data provided by the 
PASCAL detection challenge, allowing us to explore the 
potential gains of larger numbers of positive training 
instances. We evaluate detection accuracy on the 11 
PASCAL categories from the PASCAL 2010 trainval 
dataset (because test annotations are not public), which 
contains 10000+ images. 

Faces: In addition to examining performance on 
PASCAL object categories, we also trained models for 
face detection. We found faces to contain more struc¬ 
tured appearance variation, which often allowed for more 
easily interpretable diagnostic experiments. Face models 


are trained using the CMU MultiPIE dataset (Gross et al 


2010), a well-known benchmark dataset of faces span¬ 


ning multiple viewpoints, illumination conditions, and 
expressions. We use up to 900 faces across 13 view points. 
Each viewpoint was spaced 15° apart spanning 180°. 
300 of the faces are frontal, while the remaining 600 are 
evenly distributed among the remaining viewpoints. Eor 
negatives, we use 1218 images from the INRIAPerson 
database (Dalai and Triggs 2005). Detection accuracy of 
face models are evaluated on the annotated face in-the- 


wild (AEW) (Zhu and Ramanan 2012), which contains 
images from real-world environments and tend to have 
cluttered backgrounds with large variations in both face 
viewpoint and appearance. 


2.1 Collecting PASCAL-lOX 

In this section, we describe our procedure for building 
a large, annotated dataset that is as similar as pos¬ 
sible to the PASCAL 2010 for object detection. We 
collected images from Elickr and annotations from Ama¬ 
zon Mechanical Turk (MTurk), resulting in the data set 
summarized in Tab. We built training sets for 11 of 
the PASCAL VOC categories that are an order of mag¬ 
nitude larger than the VOC 2010 standard trainval set. 
We selected these classes as they contain the smallest 
amount of training examples, and so are most likely to 
improve from additional training data. We took care 
to ensure high-quality bounding box annotations and 
high-similarity to the PASCAL 2010 dataset. To our 


^ The dataset can be downloaded from http://vision.ics 
uci.edu/datasets/ 


PASCAL 2010 Our Data Set 


Category Images Objects Images Objects 


Bicycle 

471 

614 

5,027 

7,401 

Bus 

353 

498 

3,405 

4,919 

Cat 

1,005 

1,132 

12,204 

13,998 

Cow 

248 

464 

3,194 

6,909 

Dining Table 

415 

468 

3,905 

5,651 

Horse 

425 

621 

4,086 

6,488 

Motorbike 

453 

611 

5,674 

8,666 

Sheep 

290 

701 

2,351 

6,018 

Sofa 

406 

451 

4,018 

5,569 

Train 

453 

524 

6,403 

7,648 

TV Monitor 

490 

683 

5,053 

7,808 

Totals 

4,609 

6,167 

50,772 

81,075 


Table 1 PASCAL 2010 trainval and our data set for select 
categories. Our data set is an order of magnitude larger. 


PASCAL 


Attributes 

Us 

2010 

2007 

Truncated 

30.8 

31.5 

15.8 

Occluded 

5.9 

8.6 

7.1 

Jumping 

4.0 

4.3 

15.8 

Standing 

69.9 

68.8 

54.6 

Trotting 

23.5 

24.9 

26.6 

Sitting 

2.0 

1.4 

0.7 

Other 

0.0 

0.5 

0 

Person Top 

24.8 

29.1 

57.5 

Person Besides 

8.8 

10.0 

8.6 

No Person 

66.0 

59.8 

33.8 


Table 2 Frequencies of attributes (percent) across images in 
our lOx horse data set compared to the PASCAL 2010 train¬ 
val data set. Bolded entries highlight significant differences 
relative to our collected data. Our dataset has similar attribute 
distribution to the PASCAL 2010, but differs significantly from 
2007, which has many more sporting events. 


knowledge, this is the largest publicly available positive 
training set for these PASCAL categories. 

Collection: We downloaded over one hundred thou¬ 
sand large images from Elickr to build our dataset. We 
took care to directly mimic the collection procedure 
used by the PASCAL organizers. We begin with a set of 
keywords (provided by the organizers) associated with 
each object class. Eor each class, we picked a random 
keyword, chose a random date since Elickr’s launch, se¬ 
lected a random page on the results, and finally took a 
random image from that page. We repeat this procedure 
until we had downloaded an order of magnitude larger 
number of images for each class. 

Filtering: The downloaded images from Elickr did 
not necessarily contain objects for the category that 
we were targeting. We created MTurk tasks that asked 
workers to classify the downloaded images on whether 
they contained the category of interest. Our user inter¬ 
face in Eig.j^gave workers instructions on how to handle 
special cases and this resulted in acceptable annotation 
quality without finding agreement between workers. 
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When to mark yes: When to mark no: 

• if the image contains at least one horse. * if all the horses are toys or photoshopped. 

• if the horses are clearly visible through glass. • if the image is about horses but does not contain 

• if the horses are in a mirror. ahorse. 

• if the horses are in poor lighting, but still visible. * horse is very tiny. 

• ifthe image has a picture of horses as long as they are * if the image is taken inside a horse, 

realistic. • ifthe image is poor quality or has bad motion 

blur. 

• ifthe image is a collage or multiple images. 

• if you don't know what the image contains. 

The images are below (they may take a second to load): 



• Draw a box around each individual horse in this image. 

• If there are more than 5 horses , then label the 5 largest. 

• You must read the instructions and examples as we hand review all work 



_ Image does not contain horses 

I start Over | Your work will directly impact active research. | Submit hitT] 


Fig. 3 Our MTurk user interfaces for image classification 
and object annotation. We provided detailed instructions to 
workers, resulting in acceptable annotation quality. 


Annotation: After filtering the images, we created 
MTurk tasks instructing workers to draw bounding 
boxes around a specific class. Workers were only asked 
to annotate up to five objects per image using our inter¬ 
face as in FigJ^ although many workers gave us more 
boxes. On average, our system received annotations at 
three images per second, allowing us to build bounding 
boxes for 10,000 images in under an hour. As not every 
object is labeled, our data set cannot be used to perform 
detection benchmarking (it is not possible to distinguish 
false-positives from true-negatives). We experimented 
with additional validation steps, but found they were 
not necessary to obtain high-quality annotations. 


2.2 Data Quality 

To verify the quality of our annotations, we performed 
an in-depth diagnostic analysis of a particular category 
(horses). Overall, our analysis suggests that our col¬ 
lection and annotation pipeline produces high-quality 
training data that is similar to PASCAL. 

Attribute distribution: We first compared vari¬ 
ous distributions of attributes of bounding boxes from 
PASCAL-lOX to those from both PASCAL 2010 and 
2007 trainval. Attribute annotations were provided by 
manual labeling. Our findings are summarized in Tab.[^ 
Interestingly, horses collected in 2010 and 2007 vary 
significantly, while 2010 and PASCAL-lOX match fairly 
well. Our images were on average twice the resolution 
as those in PASCAL so we scaled our images down to 
construct our final dataset. 

User assessment: We also gauged the quality of 
our bounding boxes compared to PASCAL with a user 
study. We flashed a pair of horse bounding boxes, one 
from PASCAL-lOX and one from PASCAL 2010, on 
a screen and instructed a subject to label which ap¬ 
peared to be better example. Our subject preferred the 
PASCAL 2010 data set 49% of the time and our data 
set 51% of the time. Since chance is 50%-50% and our 
subject operated close to chance, this further suggests 
PASCAL-lOX matched well with PASCAL. Qualita¬ 
tively, the biggest difference observed between the two 
datasets was that PASCAL-lOX bounding boxes tend to 
be somewhat “looser” than the (hand curated) PASCAL 
2010 data. 

Redundant annotations: We tested the use of 
multiple annotations for removing poorly labeled posi¬ 
tive examples. All horse images were labeled twice, and 
only those bounding boxes that agreed across the two 
annotation sessions were kept for training. We found 
that training on these cross-verified annotations did 
not significantly affect the performance of the learned 
detector. 


3 Mixture models 


To take full advantage of additional training data, it 
is vital to grow model complexity. We accomplish this 
by adding a mixture component to capture additional 
“sub-category” structure. In this section, we describe 
various approaches for learning and representing mixture 
models. Our basic building block will be a mixture of 
linear classifiers, or templates. Formally speaking, we 
compute the detection score of an image window I as: 


S{I) = max 

m 


U>m ■ + 


( 1 ) 
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(b) Supervised 


Fig. 4 We compare supervised versus automatic (k-means) 
approaches for clustering by displaying the average RGB image 
of each cluster. The supervised methods use viewpoint labels 
to cluster the training data. Because our face data is relatively 
clean, both obtain reasonably good clusters. However, at some 
levels of the hierarchy, unsupervised clustering does seem to 
produce suboptimal partitions - for example, at K = 2. There 
is no natural way to group multi-view faces into two groups. 
Automatically selecting X is a key difficulty with unsupervised 
clustering algorithms. 


where m is a discrete mixture variable, ^(/) is a HOG im¬ 
age descriptor ( Dalai and Triggs[ 2QQ5| ), Wm is a linearly- 
scored template, and bm is an (optional) bias parameter 
that acts as a prior that favors particular templates over 
others. 


3.1 Independent mixtures 

In this section, we describe approaches for learning 
mixture models by clustering positive examples from 
our training set. We train independent linear classifiers 
bm) using positive examples from each cluster. One 
difficulty in evaluating mixture models is that fluctua¬ 
tions in the (non-convex) clustering results may mask 
variations in performance we wish to measure. We took 
care to devise a procedure for varying K (the number of 
clusters) and N (the amount of training data) in such a 



(b) Supervised 


Fig. 5 We compare supervised versus automatic (k-means) 
approaches for clustering images of PASCAL buses. Supervised 
clustering produces more clear clusters, e.g. the 21 supervised 
clusters correspond to viewpoints and object type (single vs 
double-decker). Supervised clusters perform better in practice, 
as we show in Fig. m 


manner that would reduce stochastic effects of random 
sampling. 

Unsupervised clustering: For our unsupervised 
baseline, we cluster the positive training images of each 
category into 16 clusters using hierarchical k-means, re¬ 
cursively splitting each cluster into k = 2 subclusters. 
For example, given a fixed training set, we would like the 
cluster partitions for FT = 8 to respect the cluster parti¬ 
tion of = 4. To capture both appearance and shape 
when clustering, we warp an instance to a canonical 
aspect ratio, compute its HOG descriptor (reduce the 
dimensionality with PGA for computational efficiency), 
and append the aspect ratio to the resulting feature 
vector. 

Partitioued sauipliug: Given a fixed training set 
of Nmax positive images, we would like to construct a 
smaller sampled subset, say of images, whose 

cluster partitions respect those in the full dataset. This 
is similar in spirit to stratified sampling and attempts 
to reduce variance in our performance estimates due 
to “binning artifacts” of inconsistent cluster partitions 
across re-samplings of the data. 

To do this, we first hierarchically-partition the full 
set of Nmax images by recursively applying k-means. 
We then subsample the images in the leaf nodes of the 
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Input: {Nn}-, {5^^^} 
Output: {Cn^} 

2 for n = 1 : end do 

3 for t = 1 : Nn do 

4 , , iv. 


Vz, Vn > 1 


Ei |c«J 

/^(^) /^(^) 

On ^n-1 ’ 

without replacement 


// For each Nn 

II Pick a cluster randomly 
// sample 2 ;th cluster 


6 


7 end 


end 


Algorithm 1: Partitioned sampling of the clus¬ 
ters. Nn is the number of samples to return for 
set n with Aq = N^ax] > ^n+i- is the 
cluster from the lowest level of the hierarchy 
(e.g., with A = 16 clusters) computed on the full 
dataset Nmax- Steps 4-5 randomly samples Nn 
training samples from to construct K sub¬ 

sampled clusters {Cn^}, each of which contain a 
subset of the training data while keeping the same 
distribution of the data over clusters. 


hierarchy in order to generate a smaller hierarchically 
partitioned dataset by using the same hierarchical tree 
defined over the original leaf clusters. This sub-sampling 
procedure can be applied repeatedly to produce train¬ 
ing datasets with fewer and fewer examples that still 
respects the original data distribution and clustering. 

The sampling algorithm, shown in Alg. yields a 
set of partitioned training sets, indexed by (A, N) with 
two properties: (1) for a fixed number of clusters A, 
each smaller training set is a subset of the larger ones, 
and (2) given a fixed training set size A, small clusters 
are strict refinements of larger clusters. We compute 
confidence intervals in our experiments by repeating this 
procedure multiple times to resample the dataset and 
produce multiple sets of (A, A)—consistent partitions. 

Supervised clustering: To examine the effect of 
supervision, we cluster the training data by manually 
grouping visually similar samples. For CMU MultiPIE, 
we define clusters using viewpoint annotations provided 
with the dataset. We generate a hierarchical clustering 
by having a human operator merge similar viewpoints, 
following the partitioned sampling scheme above. Since 
PASCAL-1 OX does not have viewpoint labels, we gener¬ 
ate an “over-clustering” with k-means with a large A, 
and have a human operator manually merge clusters. 
Fig. I] and Fig. [^show example clusters for faces and 
buses. 


3.2 Compositional mixtures 

In this section, we describe various architectures for 
compositional mixture models that share information 
between mixture components. We share local spatial 
regions of templates, or parts. We begin our discussion 
by reviewing standard architectures for deformable part 
models (DPMs), and show how they can be interpreted 
and extended as high-capacity mixture models. 

Deformable Part Models (DPMs): We begin 
with an analysis that shows that DPMs are equivalent 
to an exponentially-large mixture of rigid templates 
Eqn. 0 - This allows us to analyze (both theoretically 
and empirically) under what conditions a classic mixture 
model will approach the behavior of a DPM. Let the 
location of part i be (xi^yi). Given an image /, a DPM 
scores a configuration of P parts (x^y) = {{xi^yi) : i = 
1..P} as: 

5'dpm(^) = max5'(/, X, ^) where 

x,y 

P 

S{I,x,y) = '^ y] ai[u,v\-4>{I,Xi + u,yi + v) 

i=l {u,v)^Wi 

+ Pij ' '^{xi—Xj —a\j\yi — yj (2) 

ijeE 

where Wi defines the spatial extent (length and width) 
of part i. The first term defines a local appearance 
score, where is the appearance template for part 
i and (j){I^Xi^yi) is the appearance feature vector ex¬ 
tracted from location (xi^yi). The second term defines 
a pairwise deformation model that scores the relative 
placement of a pair of parts with respect to an anchor 
position (a-j\a-j^). For simplicity, we have assumed all 
filters are defined at the same scale, though the above 
can be extended to the multi-scale case. When the as¬ 
sociated relational graph G = (E, E) is tree-structured, 
one can compute the best-scoring part configuration 
max^^,^^)^^ S{G X, y) with dynamic programming, where 
i? is the space of possible part placements. Given that 
each of P parts can be placed at one of L locations, 
|i7| = ^ 10^^ for our models. 

By defining index variables in image coordinates 
u' = Xi + u and v' = yi P we can rewrite Eqn. 0 as: 

p 

S{I,x,y) = y] y £>!,[«' - Xi,v' - yi] ■ (f>{I,u',v') 

u' ,v' i=l 

+ y Aj • i’ijixi - Xj - a'lf,yi - yj - a\f) 

ijeE 

= ( y 2/) [u', v'] ■ (pil, u', v'lj + b{x, y) 

u' 

= w{x,y) ■ + h{x,y) 


(3) 
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where w{x,y)[u',v'] = ~ ~ vA' For no- 

tational convenience, we assume parts templates are 
padded with zeros outside of their default spatial ex¬ 
tent. 

From the above expression it is easy to see that 
the DPM scoring function is formally equivalent to an 
exponentially-large mixture model where each mixture 
component m is indexed by a particular configuration of 
parts (x, y). The template corresponding to each mixture 
component wix^y) is constructed by adding together 
parts at shifted locations. The bias corresponding to 
each mixture component 6 (x, is equivalent to the 
spatial deformation score for that configuration of parts. 

DPMs differ from classic mixture models previously 
defined in that they ( 1 ) share parameters across a large 
number of mixtures or rigid templates, ( 2 ) extrapolate 
by “synthesizing” new templates not encountered during 
training, and finally, (3) use dynamic programming to 
efficiently search over a large number of templates. 

Exemplar Part Models (EPMs) : To analyze the 
relative importance of part parameter sharing and ex¬ 
trapolation to new part placements, we define a part 
model that limits the possible configurations of parts to 
those seen in the N training images, written as 

Sepm{I) = max S{I^x,y) where (4) 

We call such a model an Exemplar Part Model (EPM), 
since it can also be interpreted as set of N rigid ex¬ 
emplars with shared parameters. EPMs are not to be 
confused with exemplar DPMs (EDPMs), which we 
will shortly introduce as their deformable counterpart. 
EPMs can be optimized with a discrete enumeration 
over N rigid templates rather than dynamic program¬ 
ming. However, by caching scores of the local parts, 
this enumeration can be made quite efficient even for 
large N. EPMs have the benefit of sharing, but cannot 
synthesize new templates that were not present in the 
training data. We visualize example EPM templates in 
Fig.|§ 

To take advantage of additional training data, we 
would like to explore non-parametric mixtures of DPMs. 
One practical issue is that of computation. We show 
that with a particular form of sharing, one can construct 
non-parametric DPMs that are no more computationally 
complex than standard DPMs or EPMs, but consider¬ 
ably more flexible in that they extrapolate multi-modal 
shape models to unseen configurations. 

Exemplar DPMs (EDPMs): To describe our 
model, we first define a mixture of DPMs with a shared 
appearance model, but mixture-specific shape models. 
In the extreme case, each mixture will consist of a sin¬ 
gle training exemplar. We describe an approach that 


DPM 




Fig. 7 We visualize exponentiated shape models corre¬ 

sponding to different part models. A DPM uses a unimodal 
Gaussian-like model (left), while a EPM allows for only a 
discrete set of shape configurations encountered at training 
(middle). An EDPM non-parametrically models an arbitrary 
shape function using a small set of basis functions. From this 
perspective, one can view EPMs as special cases of EDPMs 
using scaled delta functions as basis functions. 


shares both the part filter computations and dynamic 
programming messages across all mixtures, allowing us 
to eliminate almost all of the mixture-dependant com¬ 
putation. Specifically, we consider mixture of DPMs of 
the form: 


Si I) = max max 


W{z) • 0 (/) + hm{z) 


(5) 


where 2 : = (x, y) and we write a DPM as an inner max¬ 
imization over an exponentially-large set of templates 
indexed by z G i?, as in Eqn. (§. Because the appear¬ 
ance model does not depend on m, we can write: 


S{I) = max w{z) • 0(7) + b{z) 
z£f2 L 


(6) 


where b{z) = max^ 6 ^( 2 :). Interestingly, we can write 
the DPM, EPM, and EDPM in the form of Eqn. ^ by 
simply changing the shape model b{z): 


boPAdiz) = ^ ^ij • 'ilj{zi - Zj - aij) (7) 

ijeE 

bpDPMiz) = max I3ij • i;{zi - zj - a^) ( 8 ) 

ijeE 

bppMiz) = bnPMiz) + b'^^j^pj^i^z) (9) 

where is the anchor position for part i and j in 
mixture m. We write bpppj^{z) to denote a limiting 
case of bEDPM{z) with ^ij = — 00 and thus takes on a 
value of 0 when z has the same relative part locations 
as some exemplar m and —00 otherwise. 

While the EPM only considers M different part con¬ 
figurations to occur at test time, the EDPM extrapolates 
away from these shape exemplars. The spring param¬ 
eters (3 in the EDPM thus play a role similar to the 
kernel width in kernel density estimation. We show a vi¬ 
sualization of these shape models as probabilistic priors 
in Fig. 13 
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Independent 

exemplars 


EPM 



DPM detection 



Fig. 6 Classic exemplars vs EPMs. On the top row, we show three rigid templates trained as independent exemplar mixtures. 
Below them, we show their counterparts from an exemplar part model (EPM), along with their corresponding training images. 
EPMs share spatially-localized regions (or “parts”) between mixtures. Each rigid mixture is a superposition of overlapping 
parts. A single part is drawn in blue. We show parts on the top row to emphasize that these template regions are trained 
independently. On the [right], we show a template which is implicitly synthesized by a DPM for a novel test image on-the-fly. In 
Fig. |15| we show that both sharing of parameters between mixture components and implicit generation of mixture components 
corresponding to unseen part configurations contribute to the strong performance of a DPM. 


Inference: We now show that inference on EDPMs 
(Eqn.|^ can be quite efficient. Specifically, inference on a 
star-structured EDPM is no more expensive than a EPM 
built from the same training examples. Recall that EPMs 
can be efficiently optimized with a discrete enumeration 
of N rigid templates with “intelligent caching” of part 
scores. Intuitively, one computes a response map for each 
part, and then scores a rigid template by looking up 
shifted locations in the response maps. EDPMs operate 
in a similar same manner, but one convolves a “min- 
filter” with each response map before looking up shifted 
locations. To be precise, we explicitly write out the 
message-passing equations for a star-structured EDPM 
below, where we assume part i = 1 is the root without 
loss of generality: 


Sedpm{I) = ai-(t){I,zi)+y^mj{zi+a'^A (10) 

zi,m L ^^ . 

j>l 


= iiiax aj • 0(/, Zj) + pij • ^{zi — Zj) (11) 


The maximization in Eqn. @ needs only be per¬ 
formed once across mixtures, and can be computed effi¬ 


ciently with a single min-convolution or distance trans¬ 
form (Eelzenszwalb and Huttenlocher 2012). The result¬ 
ing message is then shifted by mixture-specific anchor 
positions in Eqn. (10). Such mixture-independent 
messages can be computed only for leaf parts, because 
internal parts in a tree will receive mixture-specific mes¬ 
sages from downstream children. Hence star EDPMs 
are essentially no more expensive than a EPM (be¬ 
cause a single min-convolution per part adds a negligible 
amount of computation). In our experiments, running 
a 2000-mixture EDPM is almost as fast as a standard 
6-mixture DPM. Other topologies beyond stars might 
provide greater flexibility. However, since EDPMs en¬ 
code shape non-parametrically using many mixtures, 
each individual mixture may need not deform too much, 
making a star-structured deformation model a reason¬ 
able approximation (Eig. [^. 


4 Experiments 

Armed with our array of non-parametric mixture models 
and datasets, we now present an extensive diagnostic 
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analysis on 11 PASCAL categories from the 2010 PAS¬ 
CAL trainval set and faces from the Annotated Faces in 
the Wild test set (Zhu and Ramanan, 2012). For each 
category, we train the model with varying number of 
samples (N) and mixtures (K). To train our indepen¬ 


dent mixtures, we learn rigid HOG templates (Dalai and 
Triggs 2005) with linear SVMs (Chang and Lin[ 2011). 


We calibrated SVM scores using Platt scaling (Platt 


1999). Since the goal is to calibrate scores of mixture 


components relative to each other, we found it sufficient 
to train scaling parameters using the original training 
set rather than using a held-out validation set. To train 
our compositional mixtures, we use a locally-modified 
variant of the codebase from (Felzenszwalb et al 2010). 
To show the uncertainty of the performance with re¬ 
spect to different sets of training samples, we randomly 
re-sample the training data 5 times for each N and 
K following the partitioned sampling scheme described 
in Sec. The best regularization parameter C for the 
SVM was selected by cross validation. For diagnostic 
analysis, we first focus on faces and buses. 

Evaluation: We adopt the PASCAL VOC precision- 
recall protocol for object detection (requiring 50% over¬ 
lap), and report average precision (AP). While learning 
theory often focuses on analyzing 0-1 classification error 
rather than AP (McAllester 1999), we experimentally 
verified that AP typically tracks 0-1 classification error 
and so focus on the former in our experiments. 


4.1 The importance of proper regularization 


We begin with a rather simple experiment: how does 
a single rigid HOG template tuned for faces perform 
when we give it more training data N7 Fig. shows 
the surprising result that additional training data can 
decrease performance! For imbalanced object detection 
datasets with many more negatives than positives, the 
hinge loss appears to grow linearly with the amount 
of positive training data; if one doubles the number of 
positives, the total hinge loss also doubles. This leads to 
overfitting. To address this problem, we found it crucial 
to cross-validate C across different N. By doing so, we 
do see better performance with more data (Fig. |^). 
While cross-validating regularization parameters is a 
standard procedure when applying a classifier to a new 
dataset, most off-the-shelf detectors are trained using 
a fixed C across object categories with large variations 
in the number of positives. We suspect other systems 


based on standard detectors (Felzenszwalb et al, 2010 


Dalai and Triggs 2005) may also be suffering from sub- 


optimal regularization and might show an improvement 
by proper cross-validation. 


(a) Single face template (test) 



(b) Single face template (train) 



(c) Single face template (test) 



Fig. 8 (a) More training data could hurt if we did not 

cross-validate to select the optimal C. (b) Training error, 
when measured on a fixed training set of 900 faces and 1218 
negative images, always decreases as we train with more of 
those images. This further suggests that overfitting is the 
culprit, and that proper regularization is the solution, (c) 
Test performance can change drastically with C. Importantly, 
the optimal setting of C depends on the amount of positive 
training examples N. 


4.2 The importance of clean training data 

Although proper regularization parameters proved to 
be crucial, we still discovered scenarios where additional 
training data hurt performance. Fig. shows an experi¬ 
ment with a fixed set of N training examples where we 
train two detectors: (1) All is trained with with all N 
examples, while (2) Frontal is trained with a smaller, 
“clean” subset of examples containing frontal faces. We 
cross-validate C for each model for each N. Surprisingly, 
Frontal outperforms All even though it is trained with 
less data. 

This outcome cannot be explained by a failure of 
the model to generalize from training to test data. We 
examined the training loss for both models, evaluated on 
the full training set. As expected. All has a lower SVM 
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Number of training samples 


(a) 



Fig. 9 In (a), we compare the performance of a single HOG 
template trained with N multi-view face examples, versus a 
template trained with a subset of those N examples corre¬ 
sponding to frontal faces. The frontal-face template (b) looks 
“cleaner” and makes fewer classification errors on both testing 
and training data. The fully-trained template (c) looks noisy 
and performs worse, even though it produces a lower SVM 
objective value (when both (b) and (c) are evaluated on the 
full training set). This suggests that SVMs are sensitive to 
noise and benefit from training with “clean” data. 



Fig. 10 The single bicycle template (marked with red) alone 
achieves ap=29.4%, which is almost equivalent to the per¬ 
formance of using all 8 mixtures (ap=29.7%). Both models 
strongly outperform a single-mixture model trained on the full 
training set. This suggests that these additional mixtures are 
useful during training to capture outliers and prevent “noisy” 
data from polluting a “clean” template that does most of the 
work at test time. 


Face 



Bus 



0 1000 2000 3000 4000 5000 

Num. of traininq samples 


Fig. 11 We compare the human clustering and automatic k- 
means clustering at near-identical K. We find that supervised 
clustering provides a small but noticeable improvement of 
2-5%. 


bounded loss functions (Wu and Liu 2007) by placing 
noisy examples into junk clusters that simply serve to 
explain outliers in the training set. In some cases, a sin¬ 
gle “clean” mixture component by itself explains most 
of the test performance (Fig. [T^. 

The importance of “clean” training data suggests it 
could be fruitful to correctly cluster training data into 
mixture components where each component is “clean”. 
We evaluated the effectiveness of providing fully super¬ 
vised clustering in producing clean mixtures. In Fig. 
we see a small 2% to 5% increase for manual cluster¬ 
ing. In general, we find that unsupervised clustering 
can work reasonably well but depends strongly on the 
category and features used. For example, the DPM im¬ 
plementation of (Felzenszwalb et al, 2010) initializes 
mixtures based on aspect ratios. Since faces in different 
viewpoint share similar aspect ratios, this tends to pro¬ 
duce “unclean” mixtures compared to our non-latent 
clustering. 


4.3 Performance of independent mixtures 


objective function than Frontal (1.29 vs 3.48). But in 
terms of 0-1 loss. All makes nearly twice as many classi¬ 
fication errors on the same training images (900 vs 470). 
This observation suggests that the hinge loss is a poor 
surrogate to the 0-1 loss because “noisy” hard examples 
can wildly distort the decision boundary as they incur 
a large, unbounded hinge penalty. Interestingly, latent 
mixture models can mimic the behavior of non-convex 


Given the right regularization and clean mixtures 
trained independently, we now evaluate whether perfor¬ 
mance asymptotes as the amount of training data and 
the model complexity increase. 

Fig. shows performance as we vary K and N 
after cross-validating C and using supervised clustering. 
Fig. |12a| demonstrates that increasing the amount of 
training data yields a clear improvement in performance 
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Number of training data 

(a) Face (AP vs N) 



Number of mixtures 

(b) Face (AP vs K) 



Fig. 12 (a)(c) show the monotonic non-decreasing curves 
when we add more training data. The performance saturates 
quickly at a few hundred training samples. (b)(d) show how 
the performance changes with more mixtures K. Given a fixed 
number of training samples A, the performance increases at 
the beginning, and decreases when we split the training data 
too much so that each mixture only has few samples. 


at the beginning, and the gain quickly becomes smaller 
later. Larger models with more mixtures tend to per¬ 
form worse with fewer examples due to over fitting, but 
eventually win with more data. Surprisingly, improve¬ 
ment tends to saturate at ^100 training examples per 
mixture and with ^10 mixtures. Fig. |12b| shows perfor¬ 
mance as we vary model complexity for a fixed amount 
of training data. Particularly at small data regimes, we 
see the critical point one would expect from Fig. a 
more complex model performs better up to a point, after 
which it overfits. We found similar behavior for the buses 
category which we manually clustered by viewpoint. 


We performed similar experiments for all 11 PAS¬ 
CAL object categories in our PASCAL-lOX dataset 
shown in Fig. We evaluate performance on the 
PASCAL 2010 trainval set since the testset annota¬ 
tions are not public. We cluster the training data into 
iT=[l,2,4,8,16] mixture components, and A=[50, 100, 
500, 1000, 3000, Nmax] training samples, where N^ax is 
the number of training samples collected for the given 
category. For each A, we select the best C and K 
through cross-validation. Fig. |13a[ appears to suggest 
that performance is saturating across all categories as 
we increase the amount of training data. However, if we 
plot performance on a log scale (Fig. |13b ), it appears 
to increase roughly linearly. This suggests that the re¬ 


quired training data may need to grow exponentially to 
produce a fixed improvement in accuracy. For example, 
if we extrapolate the steepest curve in Fig. 13b| (mo¬ 
torbike), we will need 10^^ motorbike samples to reach 
95% AP! 


Of course 95% AP may not be an achievable level 
of performance. There is some upper-bound imposed by 
the Bayes risk associated with the HOG feature space 
which no amount of training data will let us surpass. Are 
classic mixtures of rigid templates approaching the Bayes 
optimal performance ? Of course we cannot compute the 
Bayes risk so this is hard to answer in general. However, 
the performance of any system operating on the same 
data and feature space provides a lower bound on the 
optimal performance. We next analyze the performance 
of compositional mixtures to provide much better lower 
bound on optimal performance. 


4.4 Performance of compositional mixtures 


We now perform a detailed analysis of compositional 
mixture models, including DPMs, EPMs, and EDPMs. 
We focus on face detection and Pascal buses. We con¬ 


sider the latent star-structured DPM of (Felzenszwalb 


et al 2010) as our primary baseline. For face detection. 


we also compare to the supervised tree-structured DPM 
of ( [Zhu and Ramanan , 2012), which uses facial land¬ 
mark annotations in training images as supervised part 
locations. Each of these DPMs makes use of different 
parts, and so can be used to define different EPMs and 
EDPMs. We plot performance of faces in Eig. and 
buses in FigfT6| 

Supervised DPMs: Eor face detection, we first 
note that a supervised DPM can perform quite well (91% 
AP) with less than 200 example faces. This represents a 
lower bound on the maximum achievable performance 
with a mixture of linear templates given a fixed training 
set. This performance is noticeably higher than that of 
our cross-validated rigid mixture model, which maxes 
out at an AP of 76% with 900 training examples. By 
extrapolation, we predict that one would need N = 10^^ 
training examples to achieve the DPM performance. To 
analyze where this performance gap is coming from, we 
now evaluate the performance of various compositional 
mixtures models. 

Latent parts: We begin by analyzing the perfor¬ 
mance of compositional mixtures defined by latent parts, 
as they can be constructed for both faces and Pascal 
buses. Recall that EPMs have the benefit of sharing 
parameters between rigid templates, but they cannot 
extrapolate to new shape configurations not seen among 
the N training examples. EPMs noticeably improve 
performance over independent mixtures, improving AP 
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Fig. 13 We plot the best performance at varying amount of 
training data for 11 PASCAL categories on PASCAL 2010 
trainval set. (a) shows that all the curves look saturated with 
a relatively small amount of training data; but in log scale 
(b) suggests a diminishing return instead of true saturation. 
However the performance increases so slow that we will need 
more than 10^^ examples per category to reach 95% AP if we 
keep growing at the same rate. 


from 76% to 78.5% for faces and improving AP from 
56% to 64% for buses. In fact, for large N, they approach 
the performance of latent DPMs, which is 79% for faces 
and 63% for buses. For small EPMs somewhat un¬ 
derperform DPMs. This makes sense: with very few 
observed shape configurations, exemplar-based methods 
are limited. But interestingly, with a modest number 
of observed shapes (~ 1000), exemplar-based methods 
with parameter sharing can approach the performance 
of DPMs. This in turn suggests that extrapolation to 
unseen shapes is may not be crucial, at least in the 
latent case. This is further evidenced by the fact that 


EDPMs, the deformable counterpart to EPMs, perform 
similarly to both EPMs and DPMs. 

Supervised parts: The story changes somewhat for 
supervised parts. Here, supervised EPMs outperform 
independent mixtures 85% to 76%. Perhaps surprisingly, 
EPMs even outperform latent DPMs. However, super¬ 
vised EPMs still underperform a supervised DPM. This 
suggests that, in the supervised case, the performance 
gap (85% vs 91%) stems from the ability of DPMs to 
synthesize configurations that are not seen during train¬ 
ing. Moreover, the reduction in relative error due to 
extrapolation is more significant than the reduction due 
to part sharing. (Zhu and Ramanan, 2012) point out 
that a tree-structured DPM significantly outperforms 
a star-structured DPM, even when both are trained 
with the same supervised parts. One argument is that 
trees better capture nature spatial constraints of the 
model, such as the contour-like continuity of small parts. 
Indeed, we also find that a star-structured DPM does a 
“poorer” job of extrapolation. In fact, we show that an 
EDPM does as well a supervised star model, but not 
quite up to the performance of a tree DPM. 

Analysis: Our results suggest that part models can 
be seen as a mechanism for performing intelligent pa¬ 
rameter sharing across observed mixture components 
and extrapolation to implicit, unseen mixture compo¬ 
nents. Both these aspects contribute to the strong per¬ 
formance of DPMs. However, with the “right” set of 
(supervised) parts and the “right” geometric (tree- struc¬ 
tured) constraints, extrapolation to unseen templates 
has the potential to be much more significant. We see 
this as a consequence of the “long-tail” distribution 
of object shape (Eig. [T^; many object instances can 
be modeled with a few shape configurations, but there 
exists of long tail of unusual shapes. Examples from 
the long tail may be difficult to observe in any finite 
training dataset, suggesting that extrapolation is crucial 
for recognizing these cases. Once the representation for 
sharing and extrapolation is accurately specified, fairly 
little training data is needed. Indeed, our analysis shows 


that one can train a state-of-the-art face detector (Zhu 


and Ramanan 2012) with 50 face images. 


Relation to Exemplar SVMs: In the setting of 
object detection, we were not able to see significant 
performance improvements due to our non-parametric 
compositional mixtures. However, EDPMs may be use¬ 
ful for other tasks. Specifically, they share an attractive 
property of exemplar SVMs (ESVMs) ( [Malisiewicz et al 


2011): each detection can be affiliated with its closest 


matching training example (given by the mixture in¬ 
dex), allowing us to transfer annotations from a training 
example to the test instance. (Malisiewicz et al, 2011) 
argue that non-parametric label transfer is an effective 
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Fig. 14 We visualize detections using our exemplar DPM 
(ED PM) model. As opposed t o existing exemplar-based meth¬ 
ods ( [Malisiewicz et a!) |2011| ) , our model shared parameters 
between exemplars (and so is faster to evaluate) and can 
generalize to unseen shape configurations. Moreover, EDPMs 
returns corresponding landmarks between an exemplar and a 
detected instance (and hence an associated set of landmark 
deformation vectors), visualized on the top row of faces. 


way of transferring associative knowledge, such as 3D 
pose, segmentation masks, attribute labels, etc. How¬ 
ever, unlike ESVMs, EDPMs share computation among 
the exemplars (and so are faster), can generalize to un¬ 
seen configurations (since they can extrapolate to new 
shapes), and also report a part deformation field associ¬ 
ated with each detection (which maybe useful to warp 
training labels to better match the detected instance). 
We show example detections (and their matching exem¬ 
plars) in Eig. 


14 


Paces 



Fig. 15 We compare the performance of mixtures models 
with EPMs and latent/supervised DPMs for the task of face 
detection. A single rigid template (K = 1) tuned for frontal 
faces outperforms the one tuned for all faces (as shown in 
Fig.§. Mixture models boost performance to 76%, approach¬ 
ing the performance of a latent DPM (79%). The EPM shares 
supervised part parameters across rigid templates, boosting 
performance to 85%. The supervised DPM (91%) shares pa¬ 
rameters but also implicitly scores additional templates not 
seen during training. 



Fig. 16 We compare the performance of mixture models with 
latent EPMs, EDPMs, and DPMs for bus detection. In the 
latent setting, EPMs significantly outperform the rigid mix¬ 
tures of template and match the performance of the standard 
latent DPMs. 


and Efros, 2011), and component-specific analysis of 


recognition pipelines (Parikh and Zitnick 2011). 


5 Related Work 

We view our study as complementary to other meta¬ 
analysis of the object recognition problem, such as stud¬ 
ies of the dependence of performance on the number of 


Object detection: Our analysis is focused on 
template-based approaches to recognition, as such meth¬ 
ods are currently competitive on challenging recognition 
problems such as PASCAL. However, it behooves us to 
recognize the large body of alternate approaches includ- 


object categories (Deng et al, 2010), visual properties 

ing hierarchical or “deep” feature learning (Krizhevsky 

(Hoiem et al, 2012), dataset collection bias (Torralba 

et al 

2012), local feature analysis (Tuytelaars and Miko- 
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(a) Bus 



(b) Face 

Fig. 17 We plot the number of distinct shape patterns in 
our training set of buses and faces. Each training example is 
“binned” into a discrete shape by quantizing a vector of part 
locations. The above histograms count the number of examples 
that fall into a particular shape bin. In both cases, the number 
of occurrences seems to follow a long-tail distribution: a small 
number of patterns are common, while there are a huge number 
of rare cases. Interestingly, there are less than 500 unique 
bus configurations observed in our PASCAL-lOX dataset of 
2000 training examples. This suggests that one can build 
an exemplar part model (EPM) from the “right” set of 500 
training examples and still perform similarly to a DPM trained 
on the full dataset (Fig. |16| ). 


lajczyk, 2008), kernel methods (Vedaldi et al, 2009), 


and decision trees (Bosch et al, 2007), to name a few. 
Such methods may produce different dependencies on 
performance as a function of dataset size due to inher¬ 
ent differences in model architectures. We hypothesize 
that our conclusions regarding parameter sharing and 
extrapolation may generally hold for other architectures. 

Non-parametric models in vision: Most relevant 
to our analysis is work on data-driven models for recog¬ 
nition. Non-parametric scene models have been used 
for scene completion (iHays and Efros, 2007), geoloca¬ 


tion (Hays and Efros, [2008 ). Exemplar-based methods 
have also been used for scene-labeling through label 
transfer (Liu et al, 2011 Tighe and Lazebnik, 2010). 
Other examples include nearest-neighbor methods for 
low-resolution image analysis (Torralba et al 2008) and 
image classification ( Zhang et al[ ^06[ Boiman et al 


2008[ ). The closest approach to us is (Malisiewicz et al 


2011), who learn exemplar templates for object detection. 


Our analysis suggests that it is crucial to share infor¬ 
mation between exemplars and extrapolate to unseen 
templates by re-composing parts to new configurations. 


Scalable nearest-neighbors: We demonstrate 
that compositional part models are one method for 
efficient nearest-neighbor computations. Prior work 
has explored approximate methods such as hashing 
( Shakhnarovich et al[ 200^ 2005) and kd-trees (Muja 
and Low^2QQ9( ~ Beis and Lowe, 1997). Our analysis 
shows that one can view parts as tools for exact and 
efficient indexing into an exponentially-large set of tem¬ 
plates. This suggests an alternative perspective of parts 
as computational entities rather than semantic ones. 


6 Conclusion 


We have performed an extensive analysis of the current 
dominant paradigm for object detection using HOG fea¬ 
ture templates. We specifically focused on performance 
as a function of the amount of training data, and intro¬ 
duced several non-parametric models to diagnose the 
state of affairs. 

To scale current systems to larger datasets, we find 
that one must get certain “details” correct. Specifically, 
(a) cross-validation of regularization parameters is mun¬ 
dane but crucial, (b) current discriminative classification 
machinery is overly sensitive to noisy data, suggesting 
that (c) manual cleanup and supervision or more clever 
latent optimization during learning may play an im¬ 
portant role for designing high-performance detection 
systems. We also demonstrate that HOG templates have 
a relatively small effective capacity; one can train ac¬ 
curate HOG templates with 100-200 positive examples 
(rather than thousands of examples as is typically done 
( [Dalai and Triggs 2005)). 

Erom a broader perspective, an emerging idea in 
our community is that object detection might be solved 
with simple models backed with massive training sets. 
Our experiments suggest a slightly refined view. Given 
the size of existing datasets, it appears that the cur¬ 
rent state-of-the-art will need significant additional data 
(perhaps exponentially larger sets) to continue produc¬ 
ing consistent improvements in performance. We found 
that larger gains were possible by enforcing richer con¬ 
straints within the model, often through non-parametric 
compositional representations that could make better 
use of additional data. In some sense, we need “better 
models” to make better use of “big data”. 

Another common hypothesis is that we should fo¬ 
cus on developing better features, not better learning 
algorithms. While HOG is certainly limited, we still 
see substantial performance gains without any change 
in the features themselves or the class of discriminant 
functions. Instead, the strategic issues appear to be pa¬ 
rameter sharing, compositionality, and non-parametric 
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encodings. Establishing and using accurate, clean corre¬ 
spondence among training examples (e.g., that specify 
that certain examples belong to the same sub-category, 
or that certain spatial regions correspond to the same 
part) and developing non-parametric compositional ap¬ 
proaches that implicitly make use of augmented training 
sets appear the most promising directions. 
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